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Abstract 

Resilient  and  reliable  operation  of  cyber  physical  systems 
of  societal  importance  such  as  Smart  Electric  Grids  is  one  of 
the  top  national  priorities.  Due  to  their  critical  nature,  these 
systems  are  equipped  with  fast-acting,  local  protection  mech¬ 
anisms.  However,  commonly  misguided  protection  actions 
together  with  system  dynamics  can  lead  to  un-intentional  cas¬ 
cading  effects.  This  paper  describes  the  ongoing  work  using 
Temporal  Causal  Diagrams  (TCD),  a  refinement  of  the  Timed 
Failure  Propagation  Graphs  (TFPG),  to  diagnose  problems 
associated  with  the  power  transmission  lines  protected  by  a 
combination  of  relays  and  breakers. 

The  TCD  models  represent  the  faults  and  their  propagation 
as  TFPG,  the  nominal  and  faulty  behavior  of  components 
(including  local,  discrete  controllers  and  protection  devices) 
as  Timed  Discrete  Event  Systems  (TDES),  and  capture  the 
cumulative  and  cascading  effects  of  these  interactions.  The 
TCD  diagnosis  engine  includes  an  extended  TFPG-like  rea- 
soner  which  in  addition  to  observing  the  alarms  and  mode 
changes  (as  the  TFPG),  monitors  the  event  traces  (that  corre¬ 
spond  to  the  behavioral  aspects  of  the  model)  to  generate  hy¬ 
potheses  that  consistently  explain  all  the  observations.  In  this 
paper,  we  show  the  results  of  applying  the  TCD  to  a  segment 
of  a  power  transmission  system  that  is  protected  by  distance 
relays  and  breakers. 

1.  Introduction 

Cyber-Physical  Systems  (CPS)  such  as  the  Smart  Electric  Grids 
are  going  through  transformational  reform  powered  by  fed¬ 
eral  funding  and  in  line  with  the  stated  national  energy  secu¬ 
rity  mission  goals  (Garrity,  2008).  These  systems  work  in 
dynamic  environments  resulting  from  varying  load,  changing 
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operational  requirements  and  conditions,  physical  component 
degradation,  and  software  failures.  To  reach  the  required  level 
of  resiliency  and  reliability,  efficient  online  management  of 
CPS  is  necessary  to  operate  safely  within  specified  parame¬ 
ters,  even  in  the  presence  of  faults  (Ilic  et  al.,  2005).  One 
aspect  of  online  management  is  fault  identification,  diagnos¬ 
tics,  prognostication,  and  mitigation.  Inability  to  automati¬ 
cally  and  timely  diagnose  and  pinpoint  the  source(s)  of  fail¬ 
ures  combined  with  the  potential  side-effects  of  automated 
protection  actions  lead  to  impending  fault  cascades,  which 
can  be  avoided  (Zhang,  Ilic,  &  Tonguz,  2011;  Tholomier, 
Richards,  &  Apostolov,  2007).  Recent  blackouts  and  hurri¬ 
cane  Sandy  in  2012  demonstrated  the  grid  vulnerability  and 
reasons  to  look  at  existing  defense  mechanism  more  closely. 

Fast  acting  localized  protection  mechanisms  are  used  arrest 
the  propagation  of  failure  effects.  Electrical  protection  sys¬ 
tems  include  detection  devices  such  as  fast-acting  relays  that 
are  designed  to  detect  abnormal  changes  in  physical  proper¬ 
ties  (current,  voltage,  impedance)  and  actuation  devices  such 
as  breakers  that  can  be  triggered  to  open  the  circuit  in  electri¬ 
cal  networks.  To  observe,  track,  and  possibly  diagnose  these 
systems,  it  is  important  to  consider  the  discrete  and  continu¬ 
ous  dynamics  of  the  physical  system,  the  protection  systems 
and  their  interactions  both  in  the  nominal  and  faulty  modes  of 
operations.  During  nominal  (fault- free)  operation,  both  phys¬ 
ical  and  protection  systems  should  operate  nominally  to  pro¬ 
vide  the  desired  functionality.  If  a  fault  appears  in  the  physi¬ 
cal  system,  the  nominal  protection  system  is  expected  to  de¬ 
tect  the  failure  effect  and  isolate  the  faulty  part  of  the  system. 
In  some  cases,  the  nominal  protection  system  is  assisted  by 
a  set  of  algorithms  to  restore  the  system  functionality  to  its 
original  configuration  once  the  physical  fault  disappears  (due 
to  a  temporary  fault  or  after  repair). 

Operators  have  to  consider  the  possibilities  of  misoperations 
of  protection  systems.  Distance  relays  have  been  known  to  in¬ 
correctly  initiate  tripping  due  to  an  apparent  impedance  that 
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fall  into  the  Zone  settings  of  line  relays  caused  by  heavy  load 
and  depressed  voltage  conditions  (Pourbeik,  Kundur,  &  Tay¬ 
lor,  2006).  In  fact,  an  investigation  by  North  Electric  Relia¬ 
bility  Corporation  (NERC)  demonstrated  that  nearly  all  ma¬ 
jor  system  events,  excluding  those  caused  by  severe  weather, 
have  had  relay  or  automatic  control  misoperations  (almost 
2,000  in  one  year)  contributing  to  worsening  the  impact  of 
failure  propagation  (North  American  Electric  Reliability  Cor¬ 
poration,  2012).  Protection  malfunction  and  its  correlation 
with  major  blackouts  require  a  careful  rethinking  of  its  system- 
wide  effects  (Zhang  et  al.,  2011;  Pourbeik  et  al.,  2006). 

This  paper  describes  Temporal  Causal  Diagrams  (TCD),  a  re¬ 
finement  of  the  Timed  Failure  Propagation  Graphs  (TFPG) 
(Abdelwahed,  Karsai,  Mahadevan,  &  Ofsthun,  2009),  to  di¬ 
agnose  failures  of  physical  systems  that  are  instrumented  with 
multiple  local  fast  acting  protection  devices  and  controllers 
to  isolate  the  faults.  The  TCD  is  a  discrete  abstraction  that 
captures  the  causal  and  temporal  relationships  between  fail¬ 
ure  modes  (causes)  and  discrepancies  (effects)  in  a  system, 
thereby  modeling  the  failure  cascades  taking  into  account  prop¬ 
agation  constraints  imposed  by  operating  modes,  protection 
elements,  and  timing  delays.  Faults  and  their  propagation 
are  captured  using  TFPG  models,  the  nominal  and  faulty  op¬ 
erations  of  the  components  (controllers,  protection  devices 
etc.)  are  captured  as  Timed  Discrete  Event  Systems  (TDES). 
We  also  present  a  diagnosis  reasoner  that  extends  the  TFPG 
diagnosis  algorithm  considering  both  the  alarms  and  mode 
changes  (as  reported  by  the  physical  system),  as  well  as  the 
various  event  traces  corresponding  to  the  behavioral  aspects 
of  the  mode.  The  uniqueness  of  the  approach  is  that  it  does 
not  involve  complex  real-time  computations  involving  high- 
fidelity  models,  but  performs  reasoning  using  efficient  graph 
algorithms  based  on  the  observation  of  various  anomalies  and 
events  in  the  system.  When  fine-grained  results  are  needed 
and  computing  resources  and  time  are  available,  the  diagnos¬ 
tic  hypotheses  can  be  refined  with  the  help  of  the  physics- 
based  diagnostics. 

The  paper  is  organized  as  follows.  The  next  section  (Section 
2)  deals  with  the  related  research.  Section  3  that  describes 
the  temporal  causal  diagrams.  Section  4  documents  the  re¬ 
sults  of  applying  the  solution  to  various  fault  scenarios  in  a 
power  transmission  system  and  Section  5  concludes  the  pa¬ 
per  with  a  discussion  of  the  future  work.  Notations  used  and 
an  overview  of  Timed  Failure  Propagation  Graphs  (TFPG) 
are  described  in  appendices. 

2.  Related  Research 

Fault  diagnostics  has  been  recognized  as  a  critical  task  in 
electric  grid  operations  (Coster,  Myrzik,  Kruimer,  &  Kling, 
201 1).  A  classic  but  excellent  summary  of  power  system  fault 
diagnostics  is  provided  in  (Sekine,  Akimoto,  Kunugi,  Fukui, 
&  Fukui,  2002),  including  Bayesian  approaches  (Mengshoel 


et  al.,  2010;  Yongli,  Fimin,  &  Jinling,  2006),  rule-based  rea¬ 
soning  (Melendez  et  al.,  2004;  Fee  et  al.,  2004),  expert  sys¬ 
tems  (Talukdar,  Cardozo,  &  Perry,  2007;  Yang,  Okamoto, 
Yokoyama,  &  Sekine,  1992),  fuzzy-logic  methods  (W.  Chen, 
Fiu,  &  Tsai,  2000;  Sun,  Qin,  &  Song,  2004),  Genetic  Al¬ 
gorithm,  search  based  techniques  (Fin,  Ke,  Fi,  Weng,  &  Han, 
2010),  artificial  neural  network  (Guo  et  al.,  2010;  Zhou,  1993), 
and  Petri  Nets  by  abstracting  the  power  system  as  a  discrete 
event  system  (Sun  et  al.,  2004)  (Ren,  Mi,  Zhao,  &  Yang, 
2005).  Problems  similar  to  large  electric  system  operations 
also  occur  in  smaller  systems  such  as  Electric  Ship  (Bastos, 
Zhang,  Srivastava,  &  Schulz,  2007)  and  Spacecraft  (Poll  et 
al.,  2007;  Daigle  et  al.,  2010). 

A  pioneering  paper  (Fukui  &  Kawakami,  1986)  reports  a  rule- 
based  or  logic-based  system  for  location  of  line  faults  based 
on  real  time  information  acquired  at  the  control  center  of  a 
power  system.  (Sekine  et  al.,  2002)  compiled  a  comprehen¬ 
sive  survey  of  the  fault  diagnostics  systems  developed  using 
various  knowledge-based  system  techniques.  Model-based 
approaches  based  on  logic  behaviors  of  the  protection  devices 
are  identified  as  valuable  tools  for  fault  analysis.  The  on-line 
alarm  analyzer  reported  in  (Miao,  Sforna,  &  Fiu,  1996)  incor¬ 
porates  the  cause-effect  principles  of  protective  devices  into 
logic-based  proof-oriented  algorithms  for  the  analysis  of  mal¬ 
functions.  Cause-effect  models  are  used  for  fault  diagnostics 
of  substations  in  (W.-H.  Chen,  Fiu,  &  Tsai,  2000).  Upon 
field-testing  with  real  world  data  it  was  found  that  the  proofs 
are  difficult  when  uncertainties  cannot  be  resolved.  The  proof 
algorithm  in  (Miao  et  al.,  1996)  had  to  be  generalized  in  or¬ 
der  to  evaluate  the  credibility  of  potentially  large  number  of 
hypotheses  (W.-H.  Chen  et  al.,  2000). 

The  approach  described  in  this  paper  differs  from  existing 
practice  where  fault  analysis  and  mitigation  relies  on  a  logic- 
based  approach  that  relies  on  hard  thresholds  and  local  infor¬ 
mation  assisted  by  manual  system  level  analysis.  The  causal 
model  presented  in  this  paper  is  based  on  the  timed  failure 
propagation  graph  (TFPG)  introduced  in  (Misra,  1994;  Misra, 
Sztipanovits,  &  Carnes,  1994),  which  is  conceptually  related 
to  the  temporal  causal  network  approach  presented  in  (Console 
&  Torasso,  1991;  Padalkar,  Sztipanovits,  Karsai,  Miyasaka, 
&  Okuda,  1991;  Karsai,  Sztipanovits,  Padalkar,  &  Biegl,  1992; 
Mosterman  &  Biswas,  1999).  The  TFPG  model  was  extended 
in  (Abdelwahed,  Karsai,  &  Biswas,  2004)  to  include  mode 
dependency  constraints  on  the  propagation  links,  which  can 
then  be  used  to  handle  failure  scenarios  in  hybrid  and  switch¬ 
ing  systems. 

We  have  extended  this  work  to  be  able  to  take  local  mitiga¬ 
tion  in  a  subsystem,  especially  in  case  of  malfunction  of  pro¬ 
tection  devices  results  in  a  larger  fault  cascade,  leading  to  a 
blackout  into  consideration.  This  is  primarily  done  by  consid¬ 
ering  the  discrete  behavior  of  the  protection  devices  and  using 
it  in  the  diagnosis.  The  problem  of  fault  diagnosis  in  discrete 
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event  systems  has  been  extensively  studied.  According  to 
(Sampath,  Sengupta,  Lafortune,  Sinnamohideen,  &  Teneket- 
zis,  1996),  the  fault  diagnosis  problem  can  be  described  in 
terms  of  a  description  of  a  plant’s  behavior  in  the  form  of 
a  finite  automaton  Any  behavior  of  the  plant  can  be  repre¬ 
sented  as  a  run  of  this  automaton,  i.e.  a  sequence  of  events. 
These  events  can  be  either  observable  or  unobservable.  If  the 
fault  event  is  observable  then  the  diagnosis  problem  is  triv¬ 
ial.  However,  usually  one  or  more  unobservable  events  corre¬ 
spond  to  the  occurrence  of  a  fault  that  may  occur  in  the  plant 
operation.  The  objective  is  to  find  a  diagnoser  that  can  de¬ 
tect  the  occurrence  of  a  fault  event  within  a  bounded  number 
of  steps  from  the  occurrence.  However,  we  need  to  consider 
the  possibility  of  timed  failure  propagation  and  faults  in  the 
controllers  as  well  as  plant. 

Our  approach  can  improve  the  effectiveness  of  isolating  fail¬ 
ures  in  large-scale  systems  such  as  Smart  Electric  Grids,  by 
identifying  impending  failure  propagations  and  determining 
the  time  to  critical  failure,  which  increases  the  system  relia¬ 
bility  and  reduce  the  losses  accrued  due  to  power  failures. 

3.  Temporal  Causal  Diagrams 
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Figure  1.  A  TCD  model  of  a  system  consists  of  interacting 
subsystems  containing  components,  where  each  component 
consists  of  an  interacting  TFPG  and  TDES  model. 


for  brevity,  unless  specifically  required  we  will  use  the 
shorthand  /  and  If  in  the  guard  conditions.  Actions  re¬ 
sult  in  production  of  events  that  can  be  communicated  to 
the  rest  of  the  system,  and/or  change  the  mode  of  the  sys¬ 
tem.  delay ,  if  present  declares  that  the  transition  will  oc¬ 
cur  after  the  timeout.  The  rising  edge  of  the  event  is  de¬ 
scribed  by  appending  the  uparrow  t  to  event.  The  falling 
edge  of  the  event  is  shown  using  the  downarrow 


A  Temporal  Causal  Diagram  is  a  behavior  augmented  tem¬ 
poral  failure  propagation  graph  model.  The  TCD  model  of  a 
component  can  describe  the  fault  propagation  and /  or  the  be¬ 
havior.  The  failure  propagation  is  described  in  terms  of  Timed 
Failure  Propagation  Graphs  (TFPG)1 .  The  component  behav¬ 
ior  under  nominal  and  faulty  conditions  is  captured  through 
Timed  Discrete  Event  Systems  (TDES).  A  TDES  is  charac¬ 
terized  as  follows: 

•  Q:  The  set  of  discrete  states  of  the  component 

•  F:  The  set  of  failure  modes  internal  to  the  component.  As 
always,  failures  modes  are  not  directly  observable. 

•  D:  The  set  of  discrepancies,  i.e.  potentially  observable 
anomalies,  if  any,  associated  with  the  component  behav¬ 
ior.  The  discrepancy  can  be  detected,  or  triggered  by  the 
component,  or  affect  the  component  behavior. 

•  E:  The  set  of  events  that  correspond  to  controller  com¬ 
mands,  actuation,  external  mode  commands,  detection  of 
the  physical  state  of  component,  discrepancy  detection  or 
other  internal  events.  The  detection  of  a  discrepancy,  d, 
is  written  as  df,  while  df  relates  to  the  remission  of  a 
discrepancy. 

•  A  mode  map,  M  :  Q  -A  2M  captures  the  effect  of  a  state 
in  Q  on  the  TFPG-mode  in  M.  Thus,  the  system  being 
in  a  discrete  state  affects  the  current  modes  of  the  TFPG, 
which  in  turn  affects  the  propagation  link. 

•  S  is  the  transition  map.  The  transitions  are  written  as 
[Guar  d\E  vent  (del  ay) / Actions.  The  Guard  condition 
can  represent  the  presence  of  a  local  fault  /  E  F,  written 
as  in(f)  and  absence  of  it,  written  as  \in(f).  Note  that 

1  See  appendix  A  for  an  overview  on  TFPG 


Figure  1  provides  an  overview  of  the  TCD  model  of  a  sys¬ 
tem.  The  TCD  model  is  hierarchical  where  a  system  model 
is  composed  of  subsystem  models  which  in  themselves  are 
composed  of  component  models.  The  component  model  in¬ 
cludes  TFPG  and/  or  TDES  models.  The  TCD  model  captures 
the  interactions  between  the  TFPG  and  TDES  models  both 
within  the  component,  as  well  as  across  component  bound¬ 
aries.  The  interactions  between  the  TFPG  and  TDES  models 
are  captured  implicitly  through  the  state  changes  in  the  com¬ 
mon  modeling  elements  in  the  two  models  -  failure  modes, 
discrepancies,  and  modes.  The  behavioral  model  can  be  de¬ 
signed  to  consume  and  react  to  the  updates  of  these  common 
elements  in  the  form  of  events  (appearance,  disappearance, 
change)  and  conditions  (presence,  absence).  Fikewise,  the 
behavioral  model  can  be  designed  to  update  these  common 
elements  that  can  be  consumed  by  the  failure  propagation 
model.  The  cascading  failure  propagation  effects  across  com¬ 
ponent  boundaries  is  captured  explicitly  (as  in  TFPG)  through 
failure  propagation  links  between  the  discrepancy  elements  in 
each  component.  Interactions  between  the  behavior  models 
are  based  on  the  event  generation  and  consumption  paradigm. 
A  TDES  component  can  consume  events  corresponding  to 
commands,  detection,  and  mode  changes  generated  by  one  or 
more  component  TDES  models.  It  can  also  generate  similar 
events  to  be  consumed  by  other  component  TDES  models. 

Example  1  An  example  illustrative  TCD  model  is  shown  in 
the  Figure  2.  The  failure  modes  (FI,  F2,  F3)  are  shown  as 
rectangular  blocks  and  the  discrepancies  (Dl,  D2,  D3,  D4, 
D5,  D6)  as  circular  elements.  The  fault  propagation  across 
the  TFPG  model  is  captured  by  the  edges  between  the  faults 
and  the  discrepancies.  The  markers  (Ml,  M2)  on  the  edges 
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capture  the  mode  in  which  the  fault  could  propagate  via  the 
edge.  Edges  that  do  not  carry  any  mode  marker  are  always 
enabled  implying  the  faults  can  propagate  in  any  mode  (Ml 
or  M2 )  across  these  edges. 

The  dotted-box  captures  a  behavioral  TDES  model  of  a  pro¬ 
tection  element.  It  captures  three  operational  states:  SI,  S2, 
and  S3.  SI  is  the  initial  state  which  maps  to  system  mode  Ml. 
The  protection  element  transitions  from  state  SI  to  S2  when 
it  detects  the  presence  of  a  discrepancy  D3  and  the  fault  F3 
is  not  present  (guard  condition:  \F3ScD3  t)  tmd  issues  a 
command  (event)  Cl.  The  state  transition  results  in  a  mode 
change  to  M2.  This  nominal  operation  of  the  protection  ele¬ 
ment  arrests  the  propagation  of  the  failure  effect  due  to  fault 
F2,  thereby  preventing  the  anomalies  related  to  discrepan¬ 
cies  D4,  D5,  D6  from  triggering  in  the  system.  However,  it 
could  happen  that  the  anomaly  related  to  discrepancy  D1  is 
observed  in  the  system. 

Also,  the  TDES  model  shows  that  when  the  protection  ele¬ 
ment  detects  the  absence  of  the  discrepancy  D3  ( transition: 
D3  f)),  it  issues  a  command  C2  (event)  and  transitions  back 
to  the  state  SI  (and  restores  the  system  mode  back  to  Ml). 
If  the  fault  E2  were  to  reappear  and  trigger  discrepancy  D3, 
the  protection  element  would  react  again  to  arrest  the  fault 
propagation. 

Fault  E3  captures  an  internal  fault  in  the  protection  element 
with  regards  to  detecting  the  presence  of  D3.  The  TDES 
model  captures  this  as  the  protection  element  transitioning 
into  state  S3.  When  the  fault  E3  disappears,  the  protection 
element  is  automatically  restored  to  the  nominal  state  SI. 
However,  when  in  S3  the  protection  element  cannot  react  to 
the  presence  of  the  discrepancy  D3  and  hence  cannot  arrest 
the  fault  propagation  leading  to  the  triggering  of  anomalies 
related  to  discrepancies  D4,  D5,  and  D6. 


3.1.  Event  Propagation  Paths  from  the  Behavioral  Model 

The  TDES  models  in  TCD  are  used  to  generate  event  prop¬ 
agation  paths.  An  event  propagation  path  is  generated  for 
each  transition  and  state  when  the  transition  parameters  (trig¬ 
ger,  guard,  action)  or  state  parameters  (entry/  exit/  during  ac¬ 
tions)  include  event  variables  that  belong  to  any  of  the  fol¬ 
lowing  categories:  failure  mode,  discrepancy,  or  observable 
events:  detection,  command,  and  actuation.  When  these  vari¬ 
ables  are  present  in  the  event  and /  or  guard  condition,  they 
are  treated  as  (causal)  source  nodes  of  the  event  propagation 
path.  When  they  are  present  in  the  transition  actions  and 
state  actions  (entry/during),  they  are  treated  as  the  destina¬ 
tion  (effect)  nodes.  The  modes  appear  as  source  (destination) 
nodes,  if  they  are  mapped  to  the  source  (destination)  state  in 
the  TDES  model.  Additional  nodes  in  the  event  propagation 
path  include  composition  nodes  (AND  and  OR)  that  relate/ 
combine  the  cause(s)  (source  nodes)  and  effect(s)  (destina¬ 
tion  nodes),  as  well  as  NOT  nodes  that  are  used  to  mark  ab¬ 
sence  or  disappearance  of  faults  (i.e.  failure  modes).  Multiple 
event  propagation  paths  can  be  chained  together  by  tracing 
the  state-transition  model  in  the  TDES  and  ignoring  the  inter¬ 
nal,  unobservable  states  and  events. 

Example  2  Event  propagation  paths  for  the  protection  element 
TDES  model  in  Figure  2  are 

(a)  Ml,  \F3,  D3  t  -A  Cl,  M2,  ( b )  M2,  D3  f  -A  C2,  Ml,  and 
(c)  Ml,  F3  -A  0(NoObs). 

3.2.  Reasoning  using  TCD 

The  TCD  reasoning  algorithm  relies  on  the  fault  propagation 
model  (TFPG)  and  the  event  propagation  models  (generated 
from  the  TDES)  to  hypothesize  the  possible  causes  for  the 
anomalies  and  event  traces  observed  in  the  system.  The  al¬ 
gorithm  tries  to  explain  the  observations  in  terms  of  a  consis¬ 
tency  relationship  between  the  states  of  the  nodes  and  edges 
in  the  fault  propagation  and  event  propagation  model. 

The  TCD  reasoning  algorithm  considers  the  physical ,  observed 
and  hypothetical  states  of  the  nodes  and  edges  in  the  fault 
propagation  and  event  propagation  model.  A  physical  state 
corresponds  to  the  current  state  of  the  set  (V)  of  all  the  nodes 
and  edges..  At  any  time  t,  the  physical  state  of  the  nodes  and 
edges  is  given  by  a  map  ASt  :  V  -A  {ON,  OFF}  x  M.  An 
ON  state  for  a  fault  node  indicates  that  the  failure  is  present, 
otherwise  it  is  set  to  OFF.  For  a  discrepancy  node,  an  ON  state 
indicates  that  the  failure  (effect)  has  reached  this  node,  oth¬ 
erwise  it  is  set  to  OFF.  An  ON  state  for  a  failure  propagation 
edge  indicates  that  the  edge  can  carry  the  failure  (effect)  from 
the  parent  to  the  child  node,  otherwise  it  is  set  to  OFF.  For 
the  non-failure  nodes  from  the  event  propagation  models,  an 
ON  state  indicates  that  the  associated  event-variable  or  mode- 
variable  is  set  to  the  state  represented  by  that  node,  otherwise 
the  state  is  OFF. 

The  observed  state  at  time  t  is  defined  as  a  map  St  :  V  -A 
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Algorithm  1  TCD  Reasoner  Update 

1:  INPUTS:  t,  HSt-i,  Ot. 

2:  HSt  —  UpdateHypo{t,  HSt- i) 

3:  if  Ot  0  then 

4:  HS't  =  H  St 

5:  HSt  =  0 

6:  for  all  H  e  HSt  do 

7:  if  Consis(H ,Ot)  then 

8:  HSt  <r-  HSt  U  {H} 

9:  end  if 

10:  end  for 

11:  if  HSt  7^  0  then 

12:  for  all  H  e  HSt  do 

13:  HSt  <—  HSt  U  ExplainHypo(H ,  Ot) 

14:  end  for 

15:  end  if 

16:  end  if 
17:  return  HSt 


{ON,  OFF}  x  M,  for  all  the  observable  nodes  in  the  fault  and 
event  propagation  model.  The  aim  of  the  TCD  reasoning 
process  is  to  find  a  consistent  and  plausible  explanation  of 
the  current  system  physical  state  based  on  the  observed  state. 
Such  explanation  is  given  in  the  form  of  a  valid  hypothetical 
state.  A  hypothetical  state  is  a  map  that  defines  the  states  of 
the  node  (and  edges)  and  the  interval  at  which  each  node(and 
edges)  changes  its  state.  Formally  a  hypothetical  state  at  time 
t  is  a  map  H Y'  :  V'  -X  {ON,  OFF,  UNKNOWN}  xlxl  where 
V'  C  V. 

A  reasoner  hypothesis  is  an  estimate  of  the  current  state  of  all 
nodes  in  the  system  and  the  time  period  at  which  each  node 
changed  its  state.  An  estimate  of  the  current  state  is  valid  only 
if  it  is  consistent  with  the  TCD  model.  State  consistency  in 
TCD  model  is  a  node-parent  relationship  that  can  be  extended 
pairwise  to  arbitrary  subsets  of  nodes.  The  TCD  reasoner 
uses  the  consistency  relationships  defined  in  (Abdelwahed  et 
al.,  2004;  Abdelwahed,  Karsai,  &  Biswas,  2005)  (  between 
the  TFPG  nodes  and  edges)  for  all  the  nodes  and  edges  in  the 
TCD  model,  i.e.  it  extends  the  consistency  relationship  to  the 
non-fault  nodes  in  the  event  propagation  model  as  well.  At 
any  time,  £,  during  the  reasoning  process,  the  TCD  reasoner 
uses  the  Algorithm  1  to  update  the  hypotheses  based  on  the 
current  set  of  observations.  Algorithm  1  uses  extended  ver¬ 
sions  of  the  concepts  and  algorithms  defined  in  (Abdelwahed 
et  al.,  2004,  2005)  to  account  for  event  propagation  and  con¬ 
sistency  in  event  nodes.  The  additional  procedures  invoked 
by  the  algorithm  are  briefly  described  in  the  appendix  A. 

Inputs  to  the  TCD  Diagnosis  Algorithm  1  include  the  cur¬ 
rent  time,  t  ,  the  prior  hypotheses  set,  HSt- 1,  and  the  cur¬ 
rent  alarm  and  event  observations,  Ot .  The  diagnosis  algo¬ 
rithm  (1)  returns  a  set  hypotheses  that  can  consistently  ex¬ 
plain  the  current  observed  state  of  the  TCD  system.  The  al¬ 
gorithm  starts  by  updating  the  existing  hypotheses  (HSt- 1) 
to  the  current  time  HSt  (line  #2).  Then,  it  identifies  the  set 
of  hypotheses  that  can  consistently  explain  the  current  alarm 
and  event  observations  (lines  #4-#9).  In  case  none  of  the  hy¬ 


potheses  are  consistent  with  the  observations,  the  algorithm 
generates  new  hypotheses  from  each  of  the  old  hypothesis  to 
explain  the  current  observations  (lines  #10  -  #16).  Across 
each  update,  the  TCD  reasoner  keeps  a  score  of  the  number 
of  consistent,  inconsistent,  missing,  and  pending  observations 
for  each  hypothesis  and  generates  metrics  (described  later)  to 
identify  the  best  possible  explanation,  i.e.  hypothesis. 

Hypotheses  Ranking 

The  quality  of  the  generated  hypotheses  is  measured  based  on 
three  independent  factors:  (a)  Plausibility  is  a  measure  of  the 
degree  to  which  a  given  hypothesis  group  explains  the  cur¬ 
rent  fault  and  event  signature.  ( b )  Robustness  is  a  measure  of 
the  degree  to  which  a  given  hypothesis  is  expected  to  remain 
constant,  (c)  #FM  is  a  measure  of  how  many  failure  modes 
are  listed  by  the  hypothesis.  The  reasoner  prefers  parsimony 
principle  (minimal  number  of  failure  modes)  to  report  results. 
(d)  Failure  rate  is  a  measure  of  how  often  a  particular  failure 
mode  will  occur.  In  case  of  multiple  failures,  the  failure  rates 
of  failure  modes  are  combined  assuming  independence. 

3.3.  Reasoner  improvements 

The  improvements  and  updates  in  the  TCD  reasoning  pro¬ 
cess  over  the  TFPG  reasoner  include:  (a)  Observation  evolu¬ 
tion,  i.e.  tolerating  the  evolution  or  change  in  the  observed 
state  of  the  nodes.  ( b )  Internal  mode  changes,  i.e.  account¬ 
ing  for  mode  changes  that  are  not  externally  controlled  but 
introduced  by  the  dynamics  of  the  protection  systems.  The 
mode  change  could  be  unobservable,  but  inferred  based  on 
other  observations,  (c)  Fault  negation,  i.e.  accounting  for  dis¬ 
appearance  or  absence  of  one  or  more  faults  based  on  certain 
observations. 

Handling  changes  in  the  observations 

In  case  of  the  TFPG  reasoner,  the  observed  state  of  a  discrep¬ 
ancy  node  is  either  considered  latched  or  intermittent  (due  to 
the  nature  of  the  fault  or  problems  in  the  sensor).  However  in 
TCD,  the  dynamics  of  the  protection  system  might  prevent  a 
certain  failure  propagation  and  hence  result  in  an  apparently 
consistent  change  to  the  observed  state  of  an  alarm  (or  dis¬ 
crepancy).  It  is  also  possible  that  the  both  appearance  and 
disappearance  of  a  fault  can  be  accounted  for  when  the  ob¬ 
served  state  of  the  discrepancy  is  allowed  to  change.  More 
importantly,  since  the  protection  systems  are  actively  trying 
to  arrest  the  failure  effect  propagation  and  also  respond  to  the 
disappearance  of  faults,  it  is  possible  that  the  observed  state  of 
the  non-fault  event  nodes  could  be  updated  over  time  based  on 
the  behavioral  model  of  the  protection  system.  If  the  events 
are  observable,  then  the  TCD  reasoner  updates  the  hypothet¬ 
ical  states  to  be  consistent  with  the  update  observed  state  of 
the  fault  and  non-fault  nodes.  In  the  TCD  example  shown  in 
Figure  2,  it  is  possible  that  when  the  fault  F2  happens,  the 
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Figure  3.  Segment  of  a  Power  Transmission  System 


M - Length  of  line  segment  =1 - ► 


Figure  4.  Protection  Zone  Configuration  for  Distance  Relay.  Zone  1  is  set  to  protect  80%  of  the  entire  length  of  the  line,  and 
operates  immediately  (£^)  if  the  fault  falls  in  the  zone  1  protection  region.  Zone  2  is  set  to  protect  100%  of  the  entire  line  length 
plus  at  50%  of  the  adjacent  line,  and  operates  with  time  delay,  tTJ ,  15-30  cycles.  (0.5s).  Zone  3  is  set  to  protect  100%  of  the 
entire  line  length  plus  at  100%  of  the  adjacent  line,  and  operates  with  time  delay  ,  t1/1  (1.5s) 


anomalies  D4,  D5,  D6  could  have  triggered  because  the  sys¬ 
tem  was  in  mode  Ml.  However,  once  the  protection  system 
completes  its  operation  and  the  mode  is  changed  to  M2,  the 
anomalies  related  to  D4,  D5,  D6  should  not  be  observable 
or  detectable  (based  on  the  model).  The  TCD  reasoner  can 
account  for  this  by  changing  the  hypothetical  states  of  these 
nodes  to  UNKNOWN.  Further,  later  on  if  the  mode  is  restored 
to  Ml  when  D3  disappears  (!D3)),  the  reasoner  can  account 
for  disappearance  (or  lack  of  observation)  of  D2,  D4,  D5  and 
D6.  This  is  done  by  applying  the  consistency  relationship  to 
update  the  hypothetical  state  of  fault  F2,  discrepancy  D2,  D4, 
D5,  and  D6  to  OFF. 

Mode  changes  introduced  by  protection  system 

The  protection  and  control  systems  are  actively  involved  in 
changing  the  mode  of  the  physical  system  to  arrest  the  fault 
propagation.  The  TCD  reasoning  algorithm  accounts  for  this 
by  allowing  for  a  hypothetical  state  for  each  mode.  The  hypo¬ 
thetical  state  of  the  mode  is  updated  based  on  other  observa¬ 
tions  and  the  consistency  relationship  between  the  hypotheti¬ 
cal  states  of  the  mode  with  other  TCD  nodes.  The  reasoning 


algorithm  updates  the  expected  hypothetical  states  of  other 
nodes  if  the  hypothetical  state  of  the  mode  changes.  In  the 
TCD  example  shown  in  Figure  2,  the  TCD  reasoner  updates 
the  hypothetical  states  based  on  the  mode  changes  introduced 
by  the  protection  system.  In  case  the  mode  is  changed  to 
M2  upon  appearance  of  the  fault  FI,  the  updated  hypothet¬ 
ical  state  for  D1  can  consistently  explain  any  observation  of 
anomaly  related  to  D1 .  In  case,  the  protection  system  fault  F3 
is  present,  then  the  lack  of  any  observation  (NULL)  from  the 
protection  system  and  observations  of  discrepancy  D4,  D5, 
D6  would  suggest  that  the  system  is  still  in  mode  Ml  and  the 
protection  system  has  failed  to  act  because  of  fault,  F3. 

Fault  negation 

The  TCD  reasoning  algorithm  can  generate  hypotheses  that 
state  that  one  or  more  faults  are  not  present  in  the  system. 
This  is  possible  if  the  TDES  model  (and  hence  the  event  prop¬ 
agation  model)  includes  specific  conditions  that  state  certain 
events  can  happen  only  if  the  fault  is  not  present.  The  event 
propagation  model  accounts  for  the  negated  fault,  and  updates 
the  hypothesis  appropriately  if  the  concerned  events  are  ob- 
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Table  1.  Fault  Propagation:  The  faults  in  the  transmission  lines  are  categorized  based  on  the  segment  where  they  occur  along 
the  length(L)  of  the  line  (from  left  to  right)  -  F_20:[0, 0.2L),  F_50:  [0.2L,  0.5L),  F_80:  [0.5L,0.8L),  F_100:  [0.8L,  1.0L), 
where  L  is  the  length  of  the  transmission  line.  The  row  in  the  table  should  be  read  as  described  for  the  first  row:  A  fault  F_20  in 
transmission  line  TL1  will  lead  to  a  zone  1  fault  (d_zl)  in  DR1,  a  zone  2  fault  (d_z2)  in  DR2  and  a  zone  3  fault  (d_z3)  in  DR3. 


Source  Node 
(Transmission  Line. 
Failure  Mode) 

Destination  Node 
(Relay,  zone) 

Mode 

TL1.FJ20 

DRl.cLzl,  DR2.d_z2,  DR4.d_z3 

M.Close 

TL1.  F_50 

DRl._dzl,  DR2.d_zl,  DR4.d_z3 

M_Close 

TL1.F_80 

DRl.d_zl,  DR2.d_zl,  DR4.d_z2 

M_Close 

TL1.  F.100 

DRl.d_z2,  DR2.d_zl,  DR4.d_z2 

M.Close 

TL2.  F_20 

DRl.d_z2,  DR3.d_zl,  DR4.d_z2 

M_Close 

TL2.F_50 

DRl.d_z2,  DR3.d_zl,  DR4.d_zl 

M_Close 

TL2.  F_80 

DRl.d_z3,  DR3.d_zl,  DR4.d_zl 

M.Close 

TL2.F_100 

DRl.d_z3,  DR3.d_z2,  DR4.d_zl 

M_Close 

served.  In  the  TCD  example  shown  in  Figure  2,  the  trigger¬ 
ing  of  command  Cl  by  the  protection  system  indicates  among 
other  things  the  absence  of  fault,  F3.  Also,  the  triggering  of 
command  C2,  indicates  the  disappearance  of  D3  (\D3)  and 
hence  the  negation  or  disappearance  of  the  fault  F2. 

4.  Example 

The  example  system  considered  in  this  paper  (  Figure  3)  is  a 
segment  of  a  power  transmission  system.  Power  system  com¬ 
ponents  such  as  buses,  lines,  transformers,  are  protected  by 
relays  and  breakers.  When  a  fault  occurs,  relays  and  breakers 
are  designed  to  isolate  the  fault  according  to  a  pre-determined 
protection  scheme.  Additionally,  the  system  includes  back¬ 
up  relays  to  account  for  any  problems  in  the  primary  relays 
and  breakers.  The  system  in  Figure  3  is  part  of  a  network 
and  includes  three  substations(SSl,  SS2,  and  SS3)  and  two 
transmission  lines  (TL1,TL2).  Transmission  line  TL1  carries 
power  between  buses  BUI  and  BU2  while  transmission  line 
TL2  is  between  buses  BU2  and  BU3.  Each  transmission  line 
is  protected  with  a  distance  relay  and  breaker  at  its  two  ends. 

The  distance  relays  estimate  impedance  using  the  voltage  and 
current  measurement  at  the  relay  measurement  point.  The  es¬ 
timated  impedance  is  compared  with  the  reach  point  impedance. 
If  the  estimated  impedance  is  less  than  the  reach  point  impedance, 
it  is  assumed  that  a  fault  exists  on  the  line  between  the  relay 
and  the  reach  point.  The  fault-zone  (zonel,  zone2,  zone3) 
is  determined  based  on  the  estimated  impedance.  Figure  4 
shows  the  region  corresponding  to  each  protection  zone  rel¬ 
ative  to  Relay  DR1  and  the  relative  time-scales  for  the  relay 
operation  in  each  zone.  A  distance  relay  has  to  perform  the 
dual  task  of  primary  and  back  up  protection  depending  on  the 
fault  zone.  For  faults  in  zonel  (  80%  of  the  entire  length  of 
the  transmission  line  (LI)),  it  serves  as  the  primary  protec¬ 
tion  and  acts  fast  without  any  intentional  time  delay  (  it  a1 
=  5  to  6  cycles).  For  faults  in  zone2  (up  to  50%  of  the  ad¬ 
jacent  line)  and  zone3  (up  to  100%  of  the  adjacent  line),  the 
relay  serves  as  a  back-up  and  reacts  with  some  time  delay  al¬ 
lowing  for  the  primary  relay  to  operate.  In  Zone2,  the  time 
delay  it  a2))  is  approximately  15-30  cycles  ( 0.5  sec),  while  in 


Zone3  it  acts  with  a  delay  (£a3))  of  about  1.5  sec.  Addition¬ 
ally,  to  account  for  temporary  faults  in  the  transmission  lines, 
the  relays  include  a  fast  and  delayed  auto-reclosure  function, 
wherein  they  check  for  the  fault  after  2  sec  (fast  reclosure) 
and  after  2-3  minutes  (delayed  reclosure).  In  case  the  faults 
persist,  the  relay  disconnects  the  circuit  permanently  until  it 
is  remotely  commanded  to  reset. 

Each  substation  has  a  remote  terminal  unit  (RTU)  as  part  of 
the  SCADA  system  to  send  the  breaker  status  and  other  mea¬ 
surements  to  control  center’s  Energy  Management  System 
(EMS).  Some  of  the  details  recorded  by  the  Sequence  Event 
Recorder  (SER)  at  each  substation  include:  (a)  Zone  informa¬ 
tion  and  start  protection  time  (in  case  of  zone  1)  ( b )  Tripping 
command  sent  by  relay  to  breaker  (c)  Breaker  status:  opened 
or  closed  (d)  Phase  discordance  problem:  when  breaker  tried 
to  open  three  phases  but  did  not  succeed  for  all  three  phases 
(< e )  Reclosure  command  issued  by  the  relay  to  reclose  breaker 
(/)  Reclosure  blocked  command  issued  by  relay  to  reset  breaker 
to  open  after  failed  reclosure. 

4.1.  TCD  model 

The  TCD  model  of  the  system  in  Figure  3  includes  a)  fault 
propagation  model  for  transmission  line  faults,  b)  the  breaker 
behavioral  model  and  (c)  the  distance  relay  behavioral  model. 
Fault  Propagation  Model'.  Table  1  captures  the  propagation 
of  the  faults  in  the  transmission  lines  (TL1,  TL2)  to  the  dis¬ 
crepancies  in  distance  relays  (DR1,  DR2,  DR3,  DR4).  The 
faults  in  the  transmission  lines  are  categorized  based  on  the 
segment  where  they  occur  along  the  length(L)  of  the  line 
(from  left  to  right)  -  F_20:[0, 0.2L),  F_50:  [0.2L,  0.5L),  F_80: 
[0.5L,  0.8L),  F_100:  [0.8L,  1.0L),  where  L  is  the  length  of 
the  transmission  line.  Discrepancies  correspond  to  the  zone 
with  respect  to  the  relay  -  d_zl:  zonel,  d_z2:  zone2,  d_z3: 
zone3.  All  failure  propagations  are  active  in  mode  M_Close 
when  the  circuit  is  closed. 

Breaker  Behavioral  Model :  The  breaker  behavioral  model 
(table  2)  includes  states  Open,  Close,  and  partially  open.  The 
Open  state  maps  to  the  system  mode  M.Open,  states  Close 
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Table  2.  Transitions  in  a  breaker’s  behavior  model.  The  model  includes  states  Open,  Close  and  partially  open  (P_Open).  Close 
is  the  initial  state.  Rows  1-2  capture  the  nominal  operation  to  close  and  open  the  breaker.  Rows  3-11  deal  with  faulty  operation 
-  rows  3, 4: stuck  close  fault,  rows  5 -6: stuck  open  fault,  rows  7-11:  partially  open  fault. 


# 

Src. 

State 

Dst. 

State 

Trigger 

Guard 

Action 

1 

Open 

Close 

C  -Close 

!F_st_open  &!F_part 

St_Close 

2 

Close 

Open 

C_Open 

!F_st_close  &!F_part 

St_Open 

3 

Open 

Close 

F_st_close 

none 

none 

4 

Close 

Close 

C_Open 

F_st_close 

St_Close 

5 

Close 

Open 

F_st_Open 

none 

none 

6 

Open 

Open 

C  -Close 

F_st_open 

St_Open 

7 

Open 

P_Open 

F_part 

none 

none 

8 

P_Open 

Open 

!F_part 

none 

none 

9 

Close 

P.Open 

C_Open 

F.part 

St_Open 

10 

P_Open 

P_Open 

C_Open 

F_part 

St_Open 

11 

P_Open 

Close 

C  -Close 

none 

St_Close 

Table  3.  Transition  Information  for  Distance  Relay’s  behavioral  model.  Rows  1-7  deal  with  the  anomaly  detection  in  state  Det 
(rows  1-3:  Zonel,  rows  4,5:  Zone2,  rows  6,7:  Zone3).  Rows  8,9  deal  with  wait  (until  timeout)  operation  in  Wait  state  based  on 
the  wait  time  Tw  set  for  different  operations  -  fast-reclosure(TFR),  delay ed-reclo sure  (TDR),  backup  in  zone2  (Tw2)  and  zone3 
(Tw3).  Row  10-12  deal  with  system  mode  conditions  for  anomaly  detection  (transition  to  state  Det).  Rows  13-16  handle  resets. 
Rows  17-21  deal  with  anomaly  detection  fault  (F_de). 


# 

Src  State 

Dst  State 

Trigger 

Guard 

Action 

1 

Det 

Wait 

cLzlt 

n=0 

Zl,  C_Open,  n=  1 ,  Tw=TFR 

2 

Det 

Wait 

d_zlf 

n=l 

C_Open,  FRBLK,  n=2,  Tw=TDR 

3 

Det 

BLK 

d_zlf 

n=2 

C_Open,  DRBLK 

4 

Det 

Wait 

d_z2t 

n=0 

n=3,  Tw=Tz2 

5 

Det 

BLK 

d_z2T 

n=3 

C_Open 

6 

Det 

Wait 

d_z3| 

n=0 

n=4,  Tw=Tz3 

7 

Det 

BLK 

d_z3t 

n=4 

C_Open 

8 

Wait 

Ch_Det 

Timeout  (Tw) 

n  <=  2 

C_Close 

9 

Wait 

Ch_Det 

Timeout  (Tw) 

n  >  2 

none 

10 

Ch_det 

Det 

none 

M_Close&  !L_de 

none 

11 

Ch_det 

No_Det 

none 

M_Open 

none 

12 

No_Det 

Det 

none 

M_Close 

none 

13 

No_Det 

Reset 

C -Reset 

none 

none 

14 

BLK 

Reset 

C -Reset 

none 

C_Close 

15 

Det 

Reset 

d_zl^  &d_z2^  &d_z3^  &  n>0 

none 

none 

16 

Reset 

Ch_det 

none 

none 

t  n=0 

17 

Ch_det 

Det_Err 

L_de 

none 

none 

18 

Det_Err 

Ch-Det 

!F^e 

none 

none 

19 

Det 

Det_Err 

Fide 

none 

none 

and  P_Open  (partially  open)  map  to  the  mode  M_Close.  The 
breaker  receives  commands  from  its  distance  relay  to  open 
(C_Open)  and  close  (C -Close).  After  executing  the  command, 
it  reports  the  physical  state  of  the  breaker  as  St.open  (for 
open)  and  St_close  (close).  The  behavioral  model  includes 
breaker  faults  related  to  being  stuck  open  (F_st_open),  stuck 
close  (F_st_close)  and  partially  open  (F_part).  Table  2  shows 
the  operation  of  the  breaker  in  terms  of  the  transitions  be¬ 
tween  the  states  based  on  the  events  (commands)  and  fault 
conditions.  Rows  1-2  capture  the  nominal  operation  to  close 
and  open  the  breaker  when  it  receives  the  appropriate  com¬ 
mand.  While  rows  3-4  capture  the  breaker  behavior  when  it 
is  stuck  close,  rows  5-6  deal  with  a  breaker  with  a  stuck  open 
fault.  Rows  7-11  deal  with  a  partially  open  breaker  (which 
leads  to  phase  discordance  problems  in  the  system). 

Event  propagation  paths  related  to  the  transitions  listed  in  Ta¬ 


ble  2  capture  the  pre  (source)  and  post  (destination)  condi¬ 
tions  and  observations  to  help  analyze  whether  the  breaker  is 
operating  nominally  or  is  faulty.  The  generated  event  propa¬ 
gation  paths  are  as  follows: 

(a)  M_Close,  C_Open,  !F_st_close,  !F_part  -A  St_Open,  M_Open 

(b)  M_Open,  C_Close,  !F_st_Open,!F_part  -A  St_Close,  M_Close 

(c)  M_Open,  C.Close,  F_st_Open  -A  St.Open,  M_Open 

( d )  M_Close,  C_Open,  F_st_Close  -A  St_Close,  M_Close 
0 e )  M -Close,  C_Open,  F_part  -A  St_Open,  M -Close 

if)  M_Close,  C_Close,  F_part  -A  St_Close,  M_Close 

Distance  Relay :  The  behavioral  model  states  include:  (a)  Det: 
state  when  it  is  actively  looking  for  anomalies  and  trigger¬ 
ing  appropriate  action  upon  detection,  (b)  Wait:  when  it  is 
waiting  for  a  time-out  to  expire  before  taking  the  next  set  of 
actions  (c)  BLK:  when  it  is  blocking  and  waiting  for  a  re¬ 
set  command  as  it  has  taken  the  necessary  action  to  arrest 
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Table  4.  Scenario  1:  Distance  Relays  -  Events  and  Hypotheses 


Time(s) 

Comp 

Event 

Hypotheses 

100.02 

DR3 

Zl, 

^l£>i?3=d_zl,  M:l/1 

DR4 

C_Open 

HlDm= d_zl,M:l/l 

DR1 

Z2 

HlDR1=d_z2 

Hlsys= TL2.F_20,  M:2/3 

H 2sys=TL2.F_50,  M:3/3 

H 3sys=TL2.F_80,  M:2/3 

774S?/S=TL2.F_100,  M:l/3 

102.04 

DR3,  DR4 

C_Close 

IU2I57  ' 

DR3,  DR4 

FRBLK,  C_Open 

TL2.F.50,  M:  5/5 

222.09 

DR3,  DR4 

C_Close 

222.12 

DR3,  DR4 

DRBLK,  C_Open 

H2sys=  TL2.F_50,  M:  7/7  ] 

Table  5.  Scenario  1:  Breakers  -  Events  &  Hypotheses 


Time(s) 

Comp 

Event 

Hypotheses 

100.03/ 

BR3, 

C_Open, 

HlBR3=C-Open,  M_Open 

102.08/  202.13 

BR4 

St_Open 

H l^i?4=C_Open,  M_Open 

102.05/  ' 

mo; 

C_Close, 

M  2Br^  =C _Close,  M_Close 

222.10 

BR4 

St_Close 

H 2 £#4=C _Close,  M_Close 

Table  6.  Event  trace  and  Hypotheses:  Scenario  2 


Time  (s) 

Comp 

Event 

Hypotheses 

100.02 

DR3 

Zl 

HlDm=dJzl,M:l/l 

DR4 

C_Open 

H 1  DR4=d-zl,  M:  1/1 

DR1 

Z2 

Hlr>Ri=d-z2 

H lsys=TL2.F_20,  M:2/3 

H 2sys=TL2.F_50,  M:3/3 

H 3sys=TL2.F_80,  M:2/3 
i74svs=TL2.F_100,  M:l/3 

102.07 

DRX 

NULL 

771M3=d_zl,M:l/2 

DR4 

(No 

Obs) 

HlDR4=djzl,M:l/2 

H  2  Dfl3=djzU,d^2|,(Fz3|,M:  1/1 

H2  Di?4=djz4,d-z24„d-z34.,M:  1/1 

H2sys=  TL2.F_50,  M:  3/5 

H3sys=  1TL2.F.50,  M:  2/2 

the  fault  propagation,  (d)  Det_Err:  when  it  is  unable  to  detect 
anomalies  because  of  internal  fault  (F_de),  (e)  other  miscel¬ 
laneous  states  such  as  Ch_det  (where  it  checks  if  detection 
is  feasible),  No_Det  (when  no  detection  is  possible),  Reset 
(when  it  is  resetting). 

The  distance  relays  detects  anomalies  pertaining  to  faults  in 
Zonel  (d_zl),  Zone2  (d_z2)  and  Zone3  (d_z3)  of  the  appropri¬ 
ate  transmission  line  and  reports  these  observations  through 
output-events  Z1  (Zonel),  Z2  (Zone2)  and  Z3  (Zone3)  re¬ 
spectively.  It  issues  commands  to  the  breaker  to  open  (C_Open) 
and  close  (C_Close)  and  acts  upon  command  to  reset  (C_reset). 
It  reports  unsuccessful  fast  and  delayed  re-closure  through 
the  output  events  FRBLK  and  DRBLK  respectively.  The 
faults  considered  as  part  of  the  distance  relay  include  fail¬ 
ure  to  detect  the  anomalies  in  transmission  line  impedance 
(F_de).  While  the  distance  relay  states  do  not  map  to  any 
system-modes,  the  system-modes  determine  if  the  distance 
relay  is  capable  of  detecting  anomalies  (mode:  M_Close)  or 
not  (Mode:  M_Open). 

Tables  3  describe  the  transitions  for  the  distance  relay’s  be¬ 


havioral  model.  The  rows  1-3  deal  with  the  nominal  opera¬ 
tion  when  discrepancy  related  to  zonel  fault  is  detected  (row 
2:  fast  re-closure,  row  3:  delayed  re-closure).  Rows  4,5  deal 
with  zone2  fault  and  rows  6,7  with  zone3  fault.  The  wait  time 
(Tw)  in  the  Wait  state  are  set  for  fast  reclosure  (TFR),  de¬ 
layed  reclosure  (TDR),  backup  wait  time  in  zone2  fault  (Tz2) 
and  zone3  fault  (Tz3).  These  wait  times  ( Tw )  are  used  in  the 
T I M EOUT (Tw)  operation  in  rows  8  and  9.  Rows  10,11,12 
specify  the  system  modes  in  which  the  distance  relay  can  de¬ 
tect  anomalies  i.e.  transition  to  Det  state.  Rows  13-16  deal 
with  resetting  the  distance  relay.  Rows  17-21  deal  with  pres¬ 
ence  or  disappearance  of  fault  (F_de)  related  to  problems  in 
detecting  anomalies. 

Event  propagation  paths  related  to  the  transitions  listed  in  Ta¬ 
ble  3  capture  the  pre  (source)  and  post  (destination)  condi¬ 
tions  and  observations  to  help  analyze  whether  the  distance 
relay  is  operating  nominally  or  is  faulty.  The  generated  event 
propagation  paths  are  as  follows: 

(a)  M_Close,  d_zlf  -A  Zl,  C_Open  ( b )  M_Close,  d_zlf  FRBLK, 
C_Open  (c)  M_Close,  d_zlf  -►  DRBLK,  C_Open  (d)  M_Close,  d_z2f 
-A  Z2  (e)  M_Close,  d_z2f  -►  C_Open  (/)  M_Close,  d_z3f  -►  Z3 
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Table  7.  Event  trace  and  Hypotheses:  Scenario  3 


Time  (s) 

Comp 

Events 

Hypotheses 

100.02 

DR4 

Zl,C_Open 

^lzLR4=d_zl,  M:l/1 

DR1 

Z2 

HlDRi=d_z2 

Hlsys= TL2.F_50,  M:2/2 

H 2sys=TL2.F_80,  M:l/2 
i73svs=TL2.F_100,  M:l/2 

100.07 

DR1 

C_Open 

7/ 1  DJR3=F-de,  M:  1/1 

H4sys=  TL2.F_50,  DR3.F_de  M:  3/3 

102.07 

DR? 

FRBLK,  C_Open 

H4SVS=  TL2.F_50,  DR3.F_de,  M:  4/4 

222.12  ' 

DR4 

DRBLK,  C_Open 

HAsys=  TL2.F.50,  DR3.F_de,  M:  5/5 

(g)  M_Close,  d_z3t  -►  C_Open  (h)  F_de  -A  NULL  (No  Obs) 

(0  d_z4  &  d_z2|  d_z3f  -A  NULL  (No  Obs) 

4.2.  Case  Study:  Fault  Scenarios  and  Diagnosis  Results 

This  section  considers  a  few  of  fault  scenarios  in  the  exam¬ 
ple  power  transmission  system  (Figure  3).  The  discrete  be¬ 
havioral  and  fault  propagation  model  described  in  the  Sec¬ 
tion  4.1  are  used  to  simulate  the  system  both  in  the  nominal 
and  faulty  modes.  The  simulation  is  performed  in  Acumen 
(Taha  et  al.,  2012)  with  a  simulation  time- step  of  0.01  sec. 
The  observable  event-traces  are  collected  and  analyzed  based 
on  the  algorithm  1 .  The  reasoner  uses  the  event  propagation 
paths  described  in  in  Section4. 1  to  reason  about  the  events  ob¬ 
served  in  the  breakers  (BR1,  BR2,  BR3,  BR4)  and  distance 
relays  (DR1,  DR2,  DR3,  DR4).  The  fault  propagation  model 
captured  in  Tablet  is  used  to  produce  system- wide  consistent 
hypotheses  that  can  explain  the  observed  anomalies  and  event 
traces. 

In  all  the  scenarios  described  below,  the  system  is  consid¬ 
ered  to  be  operating  in  nominal  mode  (  mod e=M .Close)  un¬ 
til  time  t=100sec,  when  transmission  line,  TL2  experiences  a 
line-to-ground- short  fault,  F_50. 

Scenario  1:  Permanent  Fault  In  Transmission  Line 

In  this  scenario,  the  fault  (TL2.F_50)  is  persistent.  The  sim¬ 
ulator  generated  event- traces  (similar  to  data  from  Sequence 
Event  Recorders  in  real  system)  are  fed  to  the  TCD  reasoner. 
Table  4,  presents  the  events  observed  from  the  distance  relays 
(DR1,DR3,  DR4)  and  the  hypotheses  generated  by  TCD  rea¬ 
soner.  The  initial  hypotheses  point  towards  a  zonel  discrep¬ 
ancy  (d_zl)  in  DR3,  DR4  and  zone2  discrepancy  in  (d_z2) 
in  DR1.  System  level  hypotheses,  H2sys  (fault:  TL2.F_50) 
has  the  maximum  metric  (3/3)  with  three  consistent  evidences 
from  DR1,DR3,DR4.  Moving  forward,  the  observations  of 
failed  reclosure  -  fast  (FRBLK)  and  delayed  (DRBLK)  -  from 
DR3,  DR4  further  support  H2sys  (7/7),  suggesting  a  diagno¬ 
sis  of  fault  in  F_50  in  TL2. 

The  events  generated  from  the  breaker  and  their  associated 
hypotheses  are  presented  in  Table  5.  The  hypotheses  suggest 
nominal  operation  and  capture  the  mode-change.  The  multi¬ 
ple  time  values  in  each  row  of  column  1  correspond  to  differ¬ 


ent  times  when  the  same  event  (&  hypotheses)  are  observed. 

Senario  2:  Temporary  Fault  In  Transmission  Line  Here, 
the  fault  (TL2.FJ50)  lasts  for  exactly  1  sec.  DR3,  DR4  come- 
up  to  test  the  fast  re-closure  2  sec  after  detecting  a  zone  1 
discrepancy  (d_zl).  Hypotheses  H2dr3,  H2r>m  identify  the 
lack  of  any  observations  to  be  consistent  with  the  event  prop¬ 
agation  path  corresponding  to  the  disappearance  of  discrep¬ 
ancies  (d_zl|,  d_z2  d_z3|).  Thereafter  system  hypotheses 
H 3sys  suggests  with  a  100%  (2/2)  supporting  evidences  that 
there  is  no  fault  in  TL2  (  !TL2.F_50  ) 

Scenario  3:  Fault  In  Transmission  Line  and  Relay  This 
is  a  multi-fault  scenario  in  which  a  distance  relay  fault,  F_de, 
prevents  DR3  from  detecting  discrepancies  produced  by  trans¬ 
mission  line  fault,  TL2.F_50.  Lack  of  observations  consis¬ 
tent  with  the  predicted  hypothetical  state  of  DR3.d_zl  suggest 
problems  with  the  event  propagation  path  (M_Close,  d_zl, 
!F_de)  in  DR3.  Hypothesis  H 1dr3  in  Table  7  explains  this 
observation  (or  lack  of),  with  fault  DR3.F_de.  The  multi-fault 
system  hypothesis  (  H4sys )  best  explains  the  observations. 

5.  Discussion  and  Conclusion 

We  have  presented  in  this  paper  a  new  formalism:  Tempo¬ 
ral  Causal  Diagrams  -  with  the  objective  of  applying  it  to  di¬ 
agnose  cyber-physical  systems  that  include  local  fast-acting 
protection  devices.  Specifically,  we  have  demonstrated  the 
capability  of  the  TCD  model  to  capture  the  discrete  fault  prop¬ 
agation  and  behavioral  model  of  a  segment  of  a  power  trans¬ 
mission  system  protected  by  distance  relays  and  breakers. 
Further,  the  paper  presented  the  potential  of  the  TCD-based 
reasoner  to  diagnose  faults  in  the  physical  system  and  its  pro¬ 
tection  elements. 

As  part  of  our  future  work,  we  wish  to  test  and  study  the 
scalability  of  this  approach  towards  a  larger  power  transmis¬ 
sion  system  including  a  far  richer  set  of  protection  elements. 
Further,  we  wish  to  consider  more  realistic  event  traces  from 
the  fault-scenarios  including  missing,  inconsistent,  and  out- 
of-sequence  alarms  and  events. 
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Nomenclature 

t  arbitrary  time  instant 

At  Alarms  observed  at  time  t 

Evt  Events  observed  at  time  t 
Ot  Observations  (Alarms  and  Events)  at  time  t 

H  Hypothesis  -  a  data  structure  that  captures  the 

hypothetical  states  of  all  the  nodes  in  the  model. 

H St  Hypotheses  set  at  time  t. 

H St  Temporary  variable  -  hypotheses  set. 
t  rising  edge  of  an  event.  Also  used  to  describe  the 
onset  of  a  discrepancy. 

|  falling  edge  of  an  event.  If  associated  with  a  dis¬ 
crepancy  it  describes  the  event  associated  with  the 
remission  of  the  discrepancy. 
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Figure  5.  TFPG  model  (t  =  10,  Mode=A  Vt  G  [0, 10]. 

Appendix 

A.  Timed  Failure  Propagation  Graph  (TFPG) 

A  TFPG  (Abdelwahed  et  al.,  2004,  2005)  is  a  labeled  directed 
graph.  The  root  nodes  are  failure  modes  (fault  causes).  The 
other  nodes  are  discrepancies  (off-nominal  conditions  that  are 
the  effects  of  failure  modes).  Edges  between  nodes  in  the 
graph  capture  the  causality  of  failure  propagation.  The  edge 
labels  capture  the  time-interval  and  operating  modes  when 
the  failure  propagation  edge  is  active.  Formally,  a  TFPG  is 
represented  as  a  tuple  (F,  D,  F,  M,  ET,  EM,  DC),  where: 

•  F  is  a  nonempty  set  of  failure  nodes. 

•  D  is  a  nonempty  set  of  discrepancy  nodes. 

•  E  C  V  x  V  is  a  set  of  edges  connecting  the  set  of  all 
nodes  V  =  F  U  D. 

•  M  is  a  nonempty  set  of  system  modes.  At  each  time 
instance  t  the  system  can  be  in  only  one  mode. 

•  ET  :  E  -A  I  is  a  map  that  associates  with  every  edge 
in  E  a  time  interval  [tmin^max]  £  I  that  represents  the 
minimum  (tmin)  and  maximum  (tmax)  time  for  failure 
propagation  over  the  edge. 

•  EM  :  E  -A  V(M)  is  a  map  that  associates  with  every 
edge  in  E  a  set  of  modes  in  M  when  the  edge  is  active. 
For  any  edge  e  G  E  that  is  not  mode-dependent  (i.e. 
active  in  all  modes),  EM(e)  =  0. 

•  DC  :  D  -A  {AND,  OR}  is  a  map  defining  the  class  of 
each  discrepancy  as  either  AND  or  an  OR  node.  An  OR 
(AND)  type  discrepancy  node  will  be  activated  when  the 
failure  propagates  to  the  node  from  any  (all)  of  its  par¬ 
ents. 

•  DS  :  D  -a  {A,  1}  is  a  map  defining  the  monitoring  sta¬ 
tus  of  the  discrepancy  as  either  A  for  the  case  when  the 
discrepancy  is  active  (monitored  by  an  online  alarm)  or 
I  for  the  case  when  the  discrepancy  is  inactive  (not  mon¬ 
itored). 

Figure  5  shows  a  graphical  depiction  of  a  failure  propaga¬ 
tion  graph  model.  Rectangles  in  the  graph  model  represent 
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the  failure  modes  while  circles  and  squares  represent  OR  and 
AND  type  discrepancies,  respectively.  The  edges  between 
the  nodes  represent  failure  propagation.  Propagation  edges 
are  parameterized  with  the  corresponding  interval,  [e.tmin, 
e.tmax],  and  the  set  of  modes  at  which  the  edge  is  active. 
Figure  5  also  shows  a  sequence  of  active  discrepancies  (alarm 
signals)  identified  by  shaded  discrepancies.  The  time  at  which 
the  alarm  is  observed  is  shown  above  the  corresponding  dis¬ 
crepancy.  Dashed  lines  are  used  to  distinguish  inactive  prop¬ 
agation  links. 

The  TFPG  reasoning  algorithm  attempts  to  explain  the  cur¬ 
rent  observations  (states  of  monitored  discrepancy  nodes)  by 
hypothesizing  the  faults  that  could  have  occured  in  the  sys¬ 
tem.  Each  hypothesis  assigns  a  hypothetical  state  to  each 
node  in  the  graph.  In  case  of  failure  modes,  an  ON  state  in¬ 
dicates  that  the  failure  is  present,  otherwise  the  state  is  OFF. 
The  state  of  a  discrepancy  node  could  be  set  to  ON  or  OFF 
depending  on  whether  the  failure-effect  has  reached  the  node 
or  not.  Alternately,  an  UNKNOWN  state  indicates  that  there  is 
not  enough  information  to  figure  out  if  the  failure-effect  has 
definitely  reached  the  node. 

The  TFPG  failure  propagation  semantics  is  used  to  identify 
and  update  the  hypothetical  states  of  the  TFPG  nodes.  For 
an  OR  discrepancy  v'  and  an  edge  e  =  (v,v')  G  E ,  once  a 
failure  effect  reaches  v  at  time  t  it  will  reach  v'  at  a  time  t' 
where  e.tmin  <  t'  —  t  <  e.tmax.  On  the  other  hand,  the 
activation  period  of  an  AND  discrepancy  v'  is  the  composi¬ 
tion  of  the  activation  periods  for  each  link  (v,  v')  G  E.  For 
a  failure  to  propagate  through  an  edge  e  =  (u,  v'),  the  edge 
should  be  active  throughout  the  propagation,  that  is,  from  the 


time  the  failure  reaches  v  to  the  time  it  reaches  v' .  An  edge  e 
is  active  if  and  only  if  the  current  operation  mode  of  the  sys¬ 
tem,  mc  is  in  the  set  of  activation  modes  of  the  edge,  that  is, 
mc  G  EM(e).  When  a  failure  propagates  to  a  monitored  dis¬ 
crepancy  node  (or  alarm)  v'  (DS(u')  =  A)  its  physical  state 
is  considered  to  be  ON,  otherwise  it  is  considered  to  be  OFF. 
If  the  link  is  deactivated  any  time  during  the  propagation  (be¬ 
cause  of  mode  switching),  the  propagation  stops.  Links  are 
assumed  to  be  memory  less  with  respect  to  failure  propaga¬ 
tion  so  that  current  failure  propagation  is  independent  of  any 
(incomplete)  previous  propagation.  Also,  once  a  failure  effect 
reaches  a  node,  its  state  will  change  permanently  and  will  not 
be  affected  by  any  future  failure  propagation. 

While  a  detailed  description  of  the  TFPG  diagnosis  algorithm 
may  be  found  in  (Abdelwahed  et  al.,  2004,  2005),  in  the  inter¬ 
est  of  self-containment  a  brief  description  of  the  procedures 
referenced  in  this  paper  is  provided  below. 

•  Consis(H ,  Ot)  :  This  procedure  checks  if  the  hypothet¬ 
ical  states  of  nodes  as  captured  in  the  hypothesis  H  are 
consistent  with  the  observations  O  at  time  t. 

•  UpdateHypo(t,  HSt- 1):  This  procedure  takes  in  as  in¬ 
put  the  current  time,  t ,  and  the  set  of  hypotheses  at  the 
previous  time-stamp,  HSt- 1  and  outputs  an  updated  set 
of  hypotheses,  HSt  which  include  any  updates  to  the 
state  of  the  nodes  based  on  the  time  elapsed. 

•  ExplainHypo(H,Ot)\  This  procedure  generates  new 
hypotheses  to  explain  the  current  observations  ( Ot )  rel¬ 
ative  to  an  existing  hypothesis  H  that  explains  the  past 
observations. 


251 


